Using Volunteers to Annotate Biomedical Corpora for Anaphora Resolution
نویسندگان
چکیده
The long-term goal of this project is to build an annotated corpus of biomedical text, to be used as a foundation for the development of automated anaphora resolution systems. We plan to explore the feasibility of using a community of volunteers to annotate a corpus drawn from publically available biomedical literature. We present issues in creating such a community and discuss results obtained from a pilot study. A motivational example In the past several years, research in machine processing of natural language has seen a resurgence of interest in issues involving anaphora resolution, the process of determining the intended referent of phrases whose interpretation depends on prior linguistic context. For example, consider the following text: The prefrontal (PF) cortex has been implicated in the remarkable to particular cues and then repeatedly reverse these responses. that required them to learn to make specific saccades in response rapidly. This ability was studied in two monkeys, using a task ability of primates to form and rearrange arbitrary associations This text contains three anaphoric expressions: “this ability”, “them”, and “these responses”. One of these (“them”) is a pronominal anaphor and the other two are full noun phrase anaphors whose anaphoric function is signaled by the determiner “this/these”. The first two anaphoric expressions, “this ability” and “them” must be interpreted as having the same reference respectively as “the remarkable ability of primates to form and rearrange arbitrary associations rapidly” and “two monkeys”. This type of anaphoric expression, where there is a preceding noun phrase with the same referent (coreference), is the most common. The third example, “these responses” is not coreferential with a preceding noun phrase. It must be interpreted as referring to the monkeys’ responses to the cues in the task, which can be inferred from the full clause “This ability was studied in two monkeys, using a task that required them to learn to make specific saccades in response to particular cues.” Copyright c © 2005, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. The problem Interest in anaphora resolution has been driven by the need for natural language processing (NLP) systems that can handle a variety of tasks, including information retrieval, information summarization, information extraction, language generation and understanding, and machine translation of natural language. Although the area of research addressing anaphora resolution issues is far from being in its infancy, there are still a number of outstanding methodological issues that need to be addressed. One such issue is how to create corpora for developing and evaluating anaphora resolution algorithms, as there are currently no set guidelines for annotation or formatting. The need for such guidelines has been recognized by other researchers. For example, as Salmon-Alt & Romary (2004) note, “There is an opportunity to stabilize the corresponding knowledge as an international standard in the context of the recently created ISO committee TC37/SC4 on language resource management. Indeed, this committee aims at providing generic standards for the representation of linguistic data at various levels.” Our goal is to develop a methodology and a framework for manual annotation of textual corpora for anaphoric relations, and to apply this methodology to build an annotated corpus of biomedical text. Such a corpus would serve as a foundation for the development of automated anaphora resolution systems for the purpose of extracting searchable semantic content. This requires development of a uniform annotation system for collecting data for training, evaluating, and reliably comparing results across systems. As part of this project, we address the feasibility of using a community of volunteers to annotate a corpus drawn from the publically available biomedical literature in the form of published papers and abstracts.
منابع مشابه
Anaphora Resolution for Biomedical Literature by Exploiting Multiple Resources
In this paper, a resolution system is presented to tackle nominal and pronominal anaphora in biomedical literature by using rich set of syntactic and semantic features. Unlike previous researches, the verification of semantic association between anaphors and their antecedents is facilitated by exploiting more outer resources, including UMLS, WordNet, GENIA Corpus 3.02p and PubMed. Moreover, the...
متن کاملChallenges in Pronoun Resolution System for Biomedical Text
This paper presents our findings on the feasibility of doing pronoun resolution for biomedical texts, in comparison with conducting pronoun resolution for the newswire domain. In our experiments, we built a simple machine learning-based pronoun resolution system, and evaluated the system on three different corpora: MUC, ACE, and GENIA. Comparative statistics not only reveal the noticeable issue...
متن کاملPronominal and Sortal Anaphora Resolution for Biomedical Literature
Anaphora resolution is one of essential tasks in message understanding. In this paper resolution for pronominal and sortal anaphora, which are common in biomedical texts, is addressed. The resolution was achieved by employing UMLS ontology and SA/AO (subject-action/action-object) patterns mined from biomedical corpus. On the other hand, sortal anaphora for unknown words was tackled by using the...
متن کاملExploring Domain Differences for the Design of a Pronoun Resolution System for Biomedical Text
Much effort in the research community has been spent on solving the anaphora resolution or pronoun resolution problem, and in particular for news texts. In order to selectively inherit the previous works and solve the same problem for a new domain, we carried out a comparative study with three different corpora: MUC, ACE for the news texts, and GENIA for bio-medical papers. Our corpus analysis ...
متن کاملResolving "This-issue" Anaphora
We annotate and resolve a particular case of abstract anaphora, namely, thisissue anaphora. We propose a candidate ranking model for this-issue anaphora resolution that explores different issuespecific and general abstract-anaphora features. The model is not restricted to nominal or verbal antecedents; rather, it is able to identify antecedents that are arbitrary spans of text. Our results show...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005